MIE1624H Introductory Example

US Baby Names 2010


In [1]:
%pwd


Out[1]:
u'/resources'

Load file into a DataFrame


In [2]:
import pandas as pd

names2010 = pd.read_csv('/resources/yob2010.txt', names=['name', 'sex', 'births'])
names2010


Out[2]:
name sex births
0 Isabella F 22731
1 Sophia F 20477
2 Emma F 17179
3 Olivia F 16860
4 Ava F 15300
5 Emily F 14172
6 Abigail F 14124
7 Madison F 13070
8 Chloe F 11656
9 Mia F 10541
10 Addison F 10253
11 Elizabeth F 10135
12 Ella F 9796
13 Natalie F 8715
14 Samantha F 8334
15 Alexis F 8181
16 Lily F 7900
17 Grace F 7598
18 Hailey F 6969
19 Alyssa F 6934
20 Lillian F 6898
21 Hannah F 6891
22 Avery F 6633
23 Leah F 6474
24 Nevaeh F 6345
25 Sofia F 6282
26 Ashley F 6276
27 Anna F 6242
28 Brianna F 6224
29 Sarah F 6223
... ... ... ...
33808 Zaviyon M 5
33809 Zaybrien M 5
33810 Zayshawn M 5
33811 Zayyan M 5
33812 Zeal M 5
33813 Zealan M 5
33814 Zecharia M 5
33815 Zeferino M 5
33816 Zekariah M 5
33817 Zeki M 5
33818 Zeriah M 5
33819 Zeshan M 5
33820 Zhyier M 5
33821 Zildjian M 5
33822 Zinn M 5
33823 Zishan M 5
33824 Ziven M 5
33825 Zmari M 5
33826 Zoren M 5
33827 Zuhaib M 5
33828 Zyeire M 5
33829 Zygmunt M 5
33830 Zykerion M 5
33831 Zylar M 5
33832 Zylin M 5
33833 Zymaire M 5
33834 Zyonne M 5
33835 Zyquarius M 5
33836 Zyran M 5
33837 Zzyzx M 5

33838 rows × 3 columns

Total number of birth in year 2010 by sex


In [3]:
names2010.groupby('sex').births.sum()


Out[3]:
sex
F    1759010
M    1898382
Name: births, dtype: int64

Insert prop column for each group


In [4]:
def add_prop(group):
    # Integer division floors
    births = group.births.astype(float)

    group['prop'] = births / births.sum()
    return group
names2010 = names2010.groupby(['sex']).apply(add_prop)

In [5]:
names2010


Out[5]:
name sex births prop
0 Isabella F 22731 0.012923
1 Sophia F 20477 0.011641
2 Emma F 17179 0.009766
3 Olivia F 16860 0.009585
4 Ava F 15300 0.008698
5 Emily F 14172 0.008057
6 Abigail F 14124 0.008030
7 Madison F 13070 0.007430
8 Chloe F 11656 0.006626
9 Mia F 10541 0.005993
10 Addison F 10253 0.005829
11 Elizabeth F 10135 0.005762
12 Ella F 9796 0.005569
13 Natalie F 8715 0.004954
14 Samantha F 8334 0.004738
15 Alexis F 8181 0.004651
16 Lily F 7900 0.004491
17 Grace F 7598 0.004319
18 Hailey F 6969 0.003962
19 Alyssa F 6934 0.003942
20 Lillian F 6898 0.003922
21 Hannah F 6891 0.003918
22 Avery F 6633 0.003771
23 Leah F 6474 0.003680
24 Nevaeh F 6345 0.003607
25 Sofia F 6282 0.003571
26 Ashley F 6276 0.003568
27 Anna F 6242 0.003549
28 Brianna F 6224 0.003538
29 Sarah F 6223 0.003538
... ... ... ... ...
33808 Zaviyon M 5 0.000003
33809 Zaybrien M 5 0.000003
33810 Zayshawn M 5 0.000003
33811 Zayyan M 5 0.000003
33812 Zeal M 5 0.000003
33813 Zealan M 5 0.000003
33814 Zecharia M 5 0.000003
33815 Zeferino M 5 0.000003
33816 Zekariah M 5 0.000003
33817 Zeki M 5 0.000003
33818 Zeriah M 5 0.000003
33819 Zeshan M 5 0.000003
33820 Zhyier M 5 0.000003
33821 Zildjian M 5 0.000003
33822 Zinn M 5 0.000003
33823 Zishan M 5 0.000003
33824 Ziven M 5 0.000003
33825 Zmari M 5 0.000003
33826 Zoren M 5 0.000003
33827 Zuhaib M 5 0.000003
33828 Zyeire M 5 0.000003
33829 Zygmunt M 5 0.000003
33830 Zykerion M 5 0.000003
33831 Zylar M 5 0.000003
33832 Zylin M 5 0.000003
33833 Zymaire M 5 0.000003
33834 Zyonne M 5 0.000003
33835 Zyquarius M 5 0.000003
33836 Zyran M 5 0.000003
33837 Zzyzx M 5 0.000003

33838 rows × 4 columns

Verify that the prop clumn sums to 1 within all the groups


In [7]:
import numpy as np

np.allclose(names2010.groupby(['sex']).prop.sum(), 1)


Out[7]:
True

Extract a subset of the data with the top 10 names for each sex


In [8]:
def get_top10(group):
    return group.sort_index(by='births', ascending=False)[:10]
grouped = names2010.groupby(['sex'])
top10 = grouped.apply(get_top10)


/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:2: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  from ipykernel import kernelapp as app

In [9]:
top10.index = np.arange(len(top10))

In [10]:
top10


Out[10]:
name sex births prop
0 Isabella F 22731 0.012923
1 Sophia F 20477 0.011641
2 Emma F 17179 0.009766
3 Olivia F 16860 0.009585
4 Ava F 15300 0.008698
5 Emily F 14172 0.008057
6 Abigail F 14124 0.008030
7 Madison F 13070 0.007430
8 Chloe F 11656 0.006626
9 Mia F 10541 0.005993
10 Jacob M 21875 0.011523
11 Ethan M 17866 0.009411
12 Michael M 17133 0.009025
13 Jayden M 17030 0.008971
14 William M 16870 0.008887
15 Alexander M 16634 0.008762
16 Noah M 16281 0.008576
17 Daniel M 15679 0.008259
18 Aiden M 15403 0.008114
19 Anthony M 15364 0.008093

Aggregate all birth by the first latter from name column


In [11]:
# extract first letter from name column
get_first_letter = lambda x: x[0]
first_letters = names2010.name.map(get_first_letter)
first_letters.name = 'first_letter'

table = names2010.pivot_table('births', index=first_letters,
                          columns=['sex'], aggfunc=sum)

In [12]:
table.head()


Out[12]:
sex F M
first_letter
A 309608 198870
B 64191 108460
C 96780 168356
D 47211 123298
E 118824 102513

Normalize the table


In [13]:
table.sum()


Out[13]:
sex
F    1759010
M    1898382
dtype: int64

In [14]:
letter_prop = table / table.sum().astype(float)

Plot proportion of boys and girls names starting in each letter


In [16]:
%matplotlib inline
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 1, figsize=(10, 8))
letter_prop['M'].plot(kind='bar', rot=0, ax=axes[0], title='Male')
letter_prop['F'].plot(kind='bar', rot=0, ax=axes[1], title='Female',
                      legend=False)


Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1b7a77e910>